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Abstract 

In this paper will be presented new approach to entropy coding: family of 
generalizations of standard numeral systems which are optimal for encoding 
sequence of equip robable symbols, into asymmetric numeral systems - optimal 
for freely chosen probability distributions of symbols. It has some similarities 
to Range Coding but instead of encoding symbol in choosing a range, we 
spread these ranges uniformly over the whole interval. This leads to simpler 
encoder - instead of using two states to define range, we need only one. This 
approach is very universal - we can obtain from extremely precise encoding 
(ABS) to extremely fast with possibility to additionally encrypt the data 
(ANS). This encryption uses the key to initialize random number generator, 
which is used to calculate the coding tables. Such preinitialized encryption 
has additional advantage: is resistant to brute force attack - to check a key we 
have to make whole initialization. There will be also presented application for 
new approach to error correction: after an error in each step we have chosen 
probability to observe that something was wrong. We can get near Shannon's 
limit for any noise level this way with expected linear time of correction. 

1 Introduction 

In practice there are used two approaches for entropy coding nowadays: building 
binary tree (Huffman coding [1]) and arithmetic/range coding ([2], [3]). The first one 
approximates probabilities of symbols with powers of 2 - isn't precise. Arithmetic 
coding is precise. It encodes symbol in choosing one of large ranges of length 
proportional to assumed probability distribution (g). Intuitively, by analogue to 
standard numeral systems - the symbol is encoded on the most important position. 
To define the current range, we need to use two numbers (states). 

We will construct precise encoding that uses only one state. It will be done by 
distributing symbols uniformly instead of in ranges - intuitively: place information 
on the least important position. Standard numeral systems are optimal for 
encoding streams of equiprobable digits. Asymmetric numeral systems ([!]) is 
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natural generalization into other, freely chosen probability distributions. If we 
choose uniform probability, with proper initialization we get standard numeral 
system. 

For the binary case: Asymmetric Binary System (ABS) there are found practi- 
cal formulas, which gives extremely precise entropy encoder for which probability 
distribution of symbols can freely change. It was show ([5J) that it can be practical 
alternative for arithmetic coding. 

For the general case: Asymmetric Numeral Systems (ANS) instead of using 
formulas, we initially use pseudorandom number generator to distribute symbols 
with assumed statistics. The precision can be still very high, but disadvantage is that 
when the probability distribution changes, we have to reinitialize. The advantage is 
that we encode/decode a few bits in one use of the table - we get compression rates 
like in arithmetic coding and transfers like in Huffman coding. On [6] is available 
demonstration. 

Another advantage is that we can use a key as the initialization of the random 
number generator, additionally encrypting the data. Such encryption is extremely 
unpredictable - uses random coding tables and hidden random variable to choose 
behavior and so the current length of block. This approach is faster than standard 
block ciphers and is much more resistant against brute force attacks. 

In the last section will be presented new approach to error correction, which 
is able to get near Shannon's limit for any noise level and is still practical - has 
expected linear (or A^lg(A^)) correction time. It can be imagined as path tracking 
- we know starting and ending position and we want to walk between them using 
the proper path. When we use this path everything is fine, but when we lost it, in 
each step we have selected probability of becoming conscious of this fact. Now we 
can go back and try to make some correction. If this probability is chosen higher 
than some threshold corresponding to Shannon's limit, the number of corrections 
we should try doesn't longer grow exponentially and so we can easily verify that it 
was the proper correction. 

Intuitively we use short blocks, but we connect their redundancy. This connec- 
tion practically allows to 'transfer' surpluses of redundancy to help with large local 
error concentrations. 

1.0.1 Very brief introduction to entropy coding 

In the possibility of choosing one of 2^ choices is stored n bits of information. Assume 
now that we can store information in choosing a sequence of bits of length n, but 
such that the probability of '1' is given (p). We can evaluate the number of such 
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sequences using Stirling's formula lim^^oo , — -} 



n \ nl ^2 -^-1/2 ^n+V2gn 



pnj {pn)\{pn)\ ^ ' (pri)?'"+V2(pn)P"+V2e" 

where p = 1 — p. So while encoding in such sequences, we can store at average 

:= — plg(p) — (1 — p) lg(l — p) bits of information/symbol (1) 

That's well known formula for Shannon's entropy. In practice we usually don't know 
the probability distribution, but we are approximating it using some statistical 
analysis. The nearer it is to the real probability distribution, the better compression 
rates we get. The final step is the entropy coder, which uses found statistics to 
encode the message. 

Even if we would know the probability distribution perfectly, the expected com- 
pression rate would be usually a bit larger than Shannon's entropy. One of the 

reason is that encoded message usually contains some additional correlations. The 
second source of such entropy increase is that entropy coders arc constructed for 
some discrete set of probability distributions, so they have to approximate the orig- 
inal one. 

In an event of probability 1/n is stored lg(n) bits, so generally in event of prob- 
ability g, should be stored lg(l/g) bits. This can be seen in Shannon's formula: it's 
average of stored bits with probabilities of events as weights. 

So if we use a coder which encodes perfectly (g^) symbol distribution to encode 
(Ps) symbol sequence, we would get at average ^5Pslg(l/9s) bits per symbol. The 
difference between this value and the optimal one is called KuUback - Leiber distance: 

(2) 

We have used second order Taylor's expansion of logarithm around 1. The first 
term vanishes and the second allows to quickly estimate how important is that 
entropy coder is precise. 



2 General concept 

We would like to encode an uncorrelated sequence of symbols of known probability 
distribution into as short as possible sequence of bits. For simplicity we will assume 
that the probability distribution doesn't change in time, but it can be naturally 
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generalized to varying distributions. The encoder will receive succeeding symbols 
and transform them into succeeding bits. 

An symbol(event) of probability p contains lg(l/p) bits of information - it 
doesn't have to be a natural number. If we just assign to each symbol a sequence 
of bits like in Huffman coding, we approximate probabilities with powers of 2. If 
we want to get closer to the optimal compression rates, we have to be more precise 
- the encoder have to be more complicated - use not only the current symbol, but 
also relate for example to the previous ones. The encoder should have some state 
in which is stored unnatural number of bits of information. This state in arithmetic 
coder are two numbers describing the current range. 

The state of presented encoder will be one natural number: a; e N. For this sub- 
section we will forget about sending bits to output and focus on encoding symbols. 
So the state x in given moment is a large natural number which encodes all already 
processed symbols. We could just encode it as a binary number after processing the 
whole sequence, but because of its size it's completely impractical. In section 4 it 
will be shown that we can transfer the youngest bits of x to assure that it stays in 
some fixed range during the whole process. For now we are looking for a rule of 
changing the state while processing a symbol s: 

encoding 

{s,x) ^ x' (3) 

decoding 

So our encoder starts with for example x = and uses above rule on succeeding 
symbols. These rules are bijective, so that we can uniquely reverse whole process - 
decode the final state back into initial sequence of symbols in reversed order. 

In given moment in x is stored some unnatural number of bits of information. 
While writing it in binary system, we would round this value up. To avoid such 
approximations, we will use convention that x is the possibility of choosing one of 
{0, 1, .., X — 1} numbers, so x contains exactly lg(a:) bits of information. 

For assumed probability distribution of n symbols, we will somehow split the 
set {0,1,.., a; — 1} into n separate subsets - of sizes Xo,..,.x„_i G N, such that 
Z^s=o -^s ~ '^^^ treat the possibility of choosing one of x numbers as the 

possibility of choosing the number of subset (s) and then choosing one of Xs numbers. 
So with probability = ^ we would choose s-th subset. We can enumerate elements 
of s-th subset from to — 1 in the same order as in the original enumeration of 
{0,l,..,x-l}. 

Summarizing: we've exchanged the possibility of choosing one of x numbers 
{lg{x) bits) into the possibility of choosing a pair: a symbol s {lg{l/qs) bits) with 
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known probability (g^) and the possibility of choosing one of Xg numbers (}g{xs) = 
lg(x) — lg{qs) bits). This (x ^ {s,Xs)) will be the bijective coding we are looking 
for. 

We will now describe how to split the range. In arithmetic coding approach 
(Range Coding), we would divide {0, .., x — 1} into ranges. In ANS we will distribute 
these subsets uniformly. 

We can describe this split using distributing function Di : N — {0, ..,n — 1}: 

n-l 

{0, ..,x-l}=[j{ye {0, ..,x- 1} : D,iy) = s} 

s=0 

We can now enumerate numbers in these subsets by counting how many elements 
from the same subset was there before: 

a^s :=#{?/ e {0,1,.., X- 1}, L'i(y) = s} D2{x) := Xd,{x) (4) 

getting bijective decoding function(D) and it's inverse coding function (C): 

D{x) := {Di{x),D2{x)) = {s,Xs) C{s,Xs) := x. 

Assume that our sequence consists of n e N symbols with given probability 
distribution {qs)s=o,..,n-i (Vs=o,..,n-i > 0). We have to construct a distributing 
function and coding/ decoding function for this distribution: such that 

\/s^x is approximately x ■ (5) 

We will now show informally how essential above condition is. In section 3 and 5 
will be shown two ways of making such construction. 

Statistically in a symbol is encoded H{q) := — J2s Is^SQs bits. 
ANS uses lg(x) — lg{xs) = lg{x/xs) bits of information to encode a symbol s from 
Xs state. Analogously to ^ using second Taylor's expansion of logarithm (around 
qs), we can estimate that our encoder needs at average: 



-i:9.'s(7)"-I^«-('8(*)+ ,,ln(2) 2,|1„(2) 
+ ^ + bits/sy,„bo.. 



We could average 

over all possible Xg to estimate how many bits/symbols we are wasting. We will do 
it in section 6. 
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3 Asymmetric Binary System (ABS) 

It occurs that in the binary case we can find simple exphcit formula for cod- 
ing/decoding functions. 

We have now two symbols: "0" and "1". Denote q := qi, so q := 1 — q = go- 
To get ~ X ■ gs, we can for example take 

Xi := \xq] (or alternatively Xi := [xq\) (7) 

Xq = X — Xi = X — [xg] (or Xq = x — [xq\ ) (8) 

Now using (|4]): Z}i(x) = 1 there is a jump of \xq~\ after it: 

s := \{x + l)g] — \xq] (or s := [{x + l)gj — [xgj) (9) 

We've just defined decoding function: D{x) = {s,Xs). 



For example for g = 0.3: 
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We will find coding function now: we have s and Xs and want to find x. 
Denote r := \xq] — xq E [0, 1) 

s := \{x + l)g] — \xq] = \{x + l)g — \xq]] = \{x + l)g — r — xq] = [g — r] 



s = 1 -v^ r < g 



s = 1: Xi = \xq] = xq + r 



X 



x\—r 



Xl 



because it's natural number and < r < g. 



s = 0: g<r<lsog>l — r>0 
Xq = X — \xq~\ = x — xq — r = xq — r 



X 



Xq + r Xo + 1 1— r 



Xq + 1 



Finally coding 



C{s,x) 



x+l 

X 

q 



1 if s = 
if s = 1 



or 



l-q 
x+l 



if s = 
1 if s = 1 



For g = 1/2 it's usual binary system (with switched digits). 



(10) 
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4 Stream coding/ decoding 



We can encode now into a large natural numbers (x). We would like to use 
ABS/ANS to encode data stream - into potentially infinite sequence of digits(bits) 
with expected uniform distribution. To do it we can sometimes transfer a part of 
information from x into a digit from a standard numeral system to enforce x to 
stay in some fixed range (/). 



4.1 Algorithm 

Let us choose that the data stream will be encoded as {0, ..,6 — 1} digits - in 
standard numeral system of base b > 2. In practice we should mainly use the 
binary system (6 = 2), but thanks of this general approach, we can for example use 
6 = 2^ to transfer whole byte at once. Symbols contain correspondingly \g[l/qs) 
bits of information. When they cumulate into Igb bits, we will transfer full digit 
to/from output, moving x back to / {bit transfer). 

Observe that taking interval in form (Z e N): 



for any x eN we have exactly one of three cases: 

• X E I or 

• X > bl — 1, then 3\keN [x/b''] G / or 

• X <l, then V(d,)g{o,..,6-i}N Blfeg^ xb'' + d-J)^-"' + .. + 4 e /. 

We will call such intervals 6-unique: starting from any natural number x, after 
eventual a few reductions {x \_x/b\) or placing a few youngest digits in x 
{x ^ xb + dt) we would finally get into / in unique way. 

For some interval(/), define 



/:= {l,l + l,..,bl-l} 



(11) 





(12) 



s 



Define: 
Stream decoding: 

{(s,x)=D(x) ; 

use s ; (e.g. to generate symbol) 

while (x^ /) 



Stream coding(s): 
{while (x^ Is) 



{put mod(x,b) to output; x=[x/bj} 
x=C(s,x) 

} 



x=xb+Migit from input' 

} 
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Figure 1: Stream coding/decoding 



We need that above functions are ambiguous reverses of each other. 
Observe that we would have it iff for s = 0, .., n — 1 and / are 6-unique: 



/ = {/,..,/&- 1} 



Is = {h, --Jsh - 1} 



for some I, Ig G N. 



We have: Ub - 1) = E. = #^ = Kb - 1)- 
Remembering that C{s,x) ~ x/qs, we finally have: 



Is ~ 



(13) 



(14) 



We will look at the behavior of Ig x while stream coding s now: 

Igx — >• a; Iga; + lg(l/gs) (modulo lg(6)) (15) 
We have three possible sources of random behavior of x: 

• we choose one of symbol (behavior) in statistical(random) way, 

• usually are irrational, 

• C{s,x) is near but not exactly x/qs- 

It suggests that Igx should cover uniformly possible space, what agrees with 
statistical simulations. That means that the probability of visiting given state x 
should be approximately proportional to 1/x. We will focus on it in section 6. 
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4.2 Analysis of a single step 

Let's concentrate on a single stream coding step. Choose some s G {0, ..,n — 1}. 
Among l{b — 1) states of / = {/, ..,lb— 1} we have ls{b — 1) appearances of symbol s. 

While choosing Is we are approximating probabilities. So to simplify further 
analysis, let us assume for the rest of the paper: 

Qs = j (16) 
Let us introduce for fractional and integer part: 

c, := {log,(g,)} G [0, 1) ks := -Llog,(g.)J G N+ (17) 

log,(l/g,) = h-Cs 1 < qsb''' =b'^ <b (18) 

where {z} = z ■— \ z\ is the fractional part. 
Now if we introduce new variable: 

V ■■= log, (y) (19) 

we will have that one coding step is approximately y {y — Cs}. 




Figure 2: Example of stream coding/decoding step for 6 = 2, k = 3, Is = 13, 
/ = 9. 4 + 3- 8 + 6 = 66, = 13/66, x = 19, b'^^-^x + 3 = 79 = 66 + 2 + 2- 4 + 3. 

The situation looks like in fig. [2] 

• The bit transfer makes that states denoted by circles will behave just like 
the state denoted by ellipse on their left. The difference between them is in 
transferred digits. 
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• The number of transferred digits has maximally two possibilities differentiating 
by 1: ks — 1 and ks 

kg is the only number such that [{lb — 1)/6'^^J e Ig 

When Qs is near some integer power of b {qs ~ b'''"), we can have a situation 
that we always transfer kg digits, but it can be treated as a special case of the 
first one {Xg — I)- 

• The states denoted by ellipses are multiplicities of correspondingly b''"'^ or 
b'^\ So if I is not a natural power of 6, there can be some states before the 
first multiplicity of b''^~^. They correspond to the last multiplicity of b'^". 

Let's assume for simplicity, that 

L := log,(/) e N (20) 
so the first state in the picture is ellipse. 

With this assumption we can have special case from the previous point, that 
always k digits are transferred, if and only if qg — b~^\ 

We assume also that we have some appearances of each symbol, so L > kg. 

• The states before the step (the top of the picture) can be divided into two 
ranges - on the left or right of some boundary value 

Xg := max{x : C{s, [x/b''"''^ \) < lb} = mm{D2{x) : Di{x) = s, x > 1} 

On the left of this value we transfer k — 1 digits (can be degenerated), on the 
right we transfer k digits. 

Prom lg{b — 1) ellipses, are on the left, ^^p^ are on the right of Xg-. 

Igib - l)b^' = {Xg - l)b + lb-Xg^{b- l)Xg 
We got exact formula: 

l<Xg^ Igb''^ = Iqgb''' = Ib^^ < lb (21) 

Intuitively the position of this boundary corresponds to the inequality 

b'''' <qs< b-^^+^. 

• While the situation on the top of the picture (before coding step) was fully 
determined by / and g^,, the distribution on the bottom (after) has full freedom: 
is made by choosing the distributing function. 

This time we have the boundary value: (^([pjzrl) ~ 
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Finally the change of state after one step of stream coding : / — > / is: 

C (x) — / *^ Lb^J) x<Xs _ u/fe'=«-[^<^^] n « ^ (22) 



we will use notation [x < Xs] :— ^ 



1 for a; < Xg 
for X > Xs 



5 Asymmetric Numeral Systems (ANS) 

In the general case: encoding a sequence of symbols with probability distribution 
< qo,..,qn~i < 1 for some n > 2, we could divide the selection of symbol 
into a few binary choices and just use ABS. In this section we will see that we 
can also encode such symbols straightforward. Unfortunately I couldn't find 
practical explicit formulas for n > 2, but we can calculate coding/decoding 
functions while the initialization, making processing of the data stream extremely 
fast. The problem is that we rather cannot table all possible probability distri- 
butions - we have to initialize for a few of them and eventually reinitialize sometimes. 

This time we fix the range we are working on (/ = {I, ..,bl — 1}), so in fact 
we are interested at stream coding/decoding functions only on this set. They are 
determined by distribution of symbols: (6— 1)/^ appearances of symbol s. This way 
we are approximating the probabilities. As it was already said - we will assume: 
Qs — J. The exact probability will be denoted from now q'g. So we have \qs — q's\ ~ ^• 

5.1 Precise coder 

We will now construct precise coder in similar way as for the binary case. 

Denote iV, := |^ : i G N+|. 

They looks to be a good approximation of positions of symbols in the distributing 
function. We have only to move them into some positions of natural numbers. 
Intuition suggests that to choose symbols, we should take succeedingly the smallest 
element which hasn't been chosen yet from these sets. 

Observe that ^(iV^ fl [0,a;]) = [xqs\, but X^sL^^sJ — ^- ^o if we would use just 
proposed algorithm, while choosing a symbol for given x, at least [xg^J appearances 
of each symbol have already appeared: Xg > [xqsl . 

For X being a natural multiplicity of / we get equalities instead. To generally 
bound Xs from above, observe that because the fractional parts of xqs sums to a 
natural number, we have ^^^[a^Q'sJ > x — n + 1. 

Finally, because Xs — x, we get: 



[xqs\ <Xs< [xqs\ + n-l =^ Xg - xqs e (-1, n - 1] 



(23) 
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Numerical simulations suggest that there are probability distributions for which 
we cannot improve this pessimistic evaluation, but in practice | usually 
smaller than 1. 

To implement this algorithm, in each step we have to find the smallest of n 
numbers. Assume we have implemented some priority queue, for example using 
a heap. Besides initialization it has two instructions: put((?/, s)) inserts {y,s) 
pair into the queue, getmin removes and returns pair which is the smallest with 
{y, s) < {y', s') 4^ y < y' relation. 

Precise initialization: 

For s = to n — 1 do {put( (1/gs, s) ) ; Xs = 
For x= / to 6/ — 1 do 

{(y, s)=getinin; put ((y + l/g^, s)) ; 
D[x] = (s,Xs) or C[s,a;s]=x 

5.2 Selfcorrecting diffusion (ScD) 

We will focus now on a bit less precise, but faster statistical initialization method: 
fill the table of size (6 — 1)/ with proper number of appearances of symbols and for 
succeeding x take symbol of random number from this table, reducing the table. 
So on the beginning it will behave like a diffusion, but it will correct itself while 
approaching the end. 

Another advantage of this approach is that after fixing (Zs), we still have huge 
(exponential in ^/) number of possible coding functions - we can choose one using 
some key, additionally encrypting the data. 

Initialization: 

(6-l)«0 (fe-l)/n-l 

m=(b-l) 1 ; symbols =(0,0,..,0,l,l,..,l,..,n — l,..,n — 1); 
For s = to 77, — 1 do = ; 
For x= / to 6/ — 1 do 

{i=random natural number from 1 to m; 

s=symbols [i] ; symbols [i] =symbols [m] ; m — ; 

D[x] = (s,Xs) or C[s,a;s]=x 

Where we can use practically any deterministic pseudorandom number genera- 
tor, like Mersenne Twister([^) and use eventual key for its initialization. 

It will be precise on the beginning and the end but generally impreciseness will 
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be larger. We will analyze it now. While selecting some symbol s, we can divide 
symbols into two groups: this symbol and the rest. So we can restrict to simplified 
model: 



Model: We have N distinguishable numbers: L copies of '1' and N — L copies 
of '0'. What is the probability that if we choose M of them, there will be X of '1'? 

K copies in M symbols can be distributed in (^) ways. After choosing one, its 
copies of '1' are distributed in L{L -1)..{L- K+1) ways, of '0' in {N - L)..{N - 
L-(M-K) + 1) ways. The number of all such sequences is N(N - 1)..(N -M + 1), 
so the probability we are looking for is: 

P L\ {N-L)\ {N-M)\ _ fM\fN-M\jN 

[k){l-k)\{n-l-m + k)\ m ~ [k)[l-k)'[l 

For this derivation denote the expected value q :— ^. 

This probability distribution should be gaussian like with maximum in ^ ^ q. 
To approximate it's width, we can use Newton's symbol approximation from the 
introduction: 

log2(Pw(i^)) ^ Mh (^^^ +(N- M)h (^^) - Nh{q) 

Because we are interested only in some approximation of width of the gaussian, 
we have omitted terms with square root - they correspond mainly to probability 
normalization. This formula has the only maximum m. K — Mq as expected. Ex- 
panding around this point up to second Taylor's term, we get 

P.,„,.W«exp(-^^(|-,)'j (24) 

We get mean derivative a = a/ Mqq{l — M/N). 

This result agrees well with exact numerical calculations. Observe that without 
(1 — M/N) term, it would be just the formula from central limit theorem for the 
binomial distribution (-P('l') = q). 

So as expected: for small M we have diffusion like behavior, but this term 
makes that with M ^ N we approach the expected value. 

Returning to the algorithm, N = {b — 1)1, M = x — l, L = {h— l)/^, K = Xg — Ig'- 



Xg-lg^ {x- l)qs ± \l{x- l)qgqg ( 1 - _ 



^-if^^Vi^'""'^'"""* '''' 
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The mean derivative is a square root of parabola with zeros in / and bl as expected. 
The maximum of this parabola will be \J~^^^^l for x = l{h + l)/2. It's the largest 
expected impreciseness - it grows with the square root of /. 

Modern pseudorandom number generators can be practically unpredictable, so 
the ANS initialization would be. It chooses for each x E I different random local 
behavior, making the state practically unpredictable hidden random variable. 

Encryption based on ANS instead of making calculation while taking succeeding 
blocks as standard ciphers, makes all calculations while initialization - processing of 
the data is much faster: just using the tables. Another advantage of such preinitial- 
ized cryptosystem is that it's more resistant to brute force attacks - while taking a 
new key to try we cannot just start decoding as usual, but we have to make whole 
initialization earlier, what can take as much time as the user wanted. We will focus 
on such cryptosystems in section 8. 



6 Statistical analysis 

In this section we will try to understand behavior, calculate some properties of 
presented coders. From construction they have some more or less random behavior 
and they process some more or less random data so we can usually make only some 
rough evaluations which occurs to agree well with numerical simulations. 



For a given coder, let us define function which measure it's impreciseness: 

es{x) = C{s, x) - x/qs (26) 



For precise coders usually |es(x)| < 1, for ScD it can be estimated by (25). We have 
to connect it with the stream version (22): introduce es{x), such that 



Cs{x) :=C{s, [x/b 



ks~\x<Xs 



J) 



X 



qjjk^-[x<Xs 



+ esix) 



(27) 



eAx] 



C_s{x) 



X 



]jks~[x<Xs 



1 



X 



qs {b''s-[x<Xs 



(28) 



These equation suggest to change variable as previously: 

y := log,(x) - log,(/) G [0, 1], / := log,(/) - log,(/) C [0, 1], x = Ib^ (29) 

Now our stream coding function will be Cg : h ^ h with Yg := \ogfj{Xs) — log^(/). 
Observe that this approximated equation can be thought as Cs{y) ~ {y — Cg}. 
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Introduce es{y) analogously as before: 



Let us connect es{y) with es{y) and es{y). For y >Ys: 

isiy) = C,{y) - y + Cs^ log,, - log;, I - y + ^ 

= logfe ( + e,(/6^) 1 - logfe l-y + CsP^ 

" '"8' {^)^ liwf-'-"'^ - log. ( - y + c. = 

where we have used the first Taylor expansion of logarithm. 

Making similar calculation for y <Ys case, we finally get = x): 



esiy)^< '''^'^ " - - (31) 

6.1 Probability distribution of the states 

We can now consider probability distribution among states our stream 
coder/decoder should asymptotically obtain while processing long stream of 
symbols/digits. 

While processing some data, the state changes in some very complicated and 
randomly looking way. Let's remind its three sources: 

• Asymmetry (the strongest) - different symbols have usually different proba- 
bility and so changes the state in completely different way. This choice of 
symbol/behaviour depends on local symbol distribution, which looks also ran- 
domly. Analogously while decoding, starting from different state, transferred 
bits denotes completely different behavior, 

• Uniform covering - usually = {log^(gs)} are irrational, so by making 
y {y ~ Cs} steps, intuitively we should cover [0, 1) range uniformly, 

• Diffusion - C{s, x) is near, but not exactly x/qs {e ^ 0), so we have some addi- 
tional, randomly looking motion around the expected state from two previous 
points. 

These points strongly suggest that the state practically behaves as random variable. 
So for example starting from any state, we should be able to reach any other. 
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Unfortunately there can be found some pathological examples: in which all \ogf^{qs) 
are rational numbers and we use precise initialization, so that we stay in some proper 
subset of /: 

I = {4, 5, 6, 7}, n = 2, Zo = = 2, Co(4) = ^o(5) = 4, Ci(4) = Ci(5) = 5 

I couldn't find qualitatively more complicated examples, but if it accidently happen 
the coder will still work as entropy coder, but with a bit different expected 
probability distribution of symbols - worse compression rate. 

We can make natural assumption (*) for the rest of the paper that: 

For each two states x, x' e /, there is a sequence of symbols (si, .., s^) 
which makes that we go from x x' : Cs^(...(Csj(x))) = x' . 

Assume now that we want to use the coder with a sequence of symbols with 
given probability distribution (ps)s=o,..,n-i such that 1 > > 0. So if in a given 
moment the coder is in state x, after one step with probability it will be in C six) 
state. It can be imagined as Markov's process. Now the assumption(*) means that 
its stochastic matrix is irreducible - from Frobenius-Perron theorem we know that 
there is a unique limit probability distribution among states: 

P : 7 ^ (0, 1), J] P{x) = 1 : V,,,^/ P{x) = J]{P(y)p. : ^(y) = x} (32) 

X 

To obtain a good understanding of the coding process, we should find a good 
general approximation of this probability distribution. The details of such process 
are extremely complicated, so to compete with this problem we should find as simple 
equations as possible - use logarithmic form y = log^{x/l). 

I is difficult to handle subset of [0, 1], so to work with probability on this set, 
we should use probability distribution function: nondecreasing function V : [0, 1] 
[0, 1], fulfilling V{0) = 0, V{1) = 1: 

iby 

T>{y) :— probability of being in state less or equal than y — ^ P{x) (33) 



x=l 



It describes stationary distribution of coding process iff 



^ ^ J -DiCsiy)) - V{CM) for y<Y, 

V[y) -Z^P^y (v{l) - V{Csm) + {^{Cs{y)) - P(0)) for y > 

T)(\ = \^ / ^^(y-c. + l + e,(7/))-P(0-c, + l + e,(0)) for y < 
^[y) Z^P. I _ + _ p(o _ c, + 1 + e,(0)) + 1 for y > 
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We see that for e = 0, the unique solution is Vly) = y for ^/ e [0, 1]. It's ideahzed 
solution - in practice we have some discrete set of states, so V cannot even be 
continuous, e is some very small randomly behaving function having different signs 
and it is somehow averaged in above equations, so intuitively V should be near this 
idealized solution. Unfortunately I wasn't able to prove it, but numerical simulations 
shows that this correction is in practice much smaller than e. 

If we return to the original states, this approximation says that 

P{x < x') ^ logb(x70. 

Differentiating it we get that P{x) is approximately proportional to 1/x. We will 
use it for further calculations. 

To work with 1/x sequences we can use well known harmonic numbers: 

^(^) ^= E 7 = 7 + Hn) + In-' - ^n-' + ^n"^ + ©(n"^) (34) 
1=1 

where 7 = 0.5772156649... Using this formula we can easily find the normalization 
coefficient jV: 

1 ^'"^ 1 

^ = ^-=7i(6/-l)-7i(/-l)^ln(6) 

x=l 

For the rest of the paper we will use 

P(x) ^ — (35) 

X 

approximation. Now we can for example calculate the probability that while encod- 
ing symbol s we will transfer kg — 1 digits: 

P{x < X,) ^ N{n{X, - 1) - n{l - 1)) ~ 1^ H^^'qs) = c, (36) 

We can also define the expected value of some functions while coding/decoding 
process: 

(/(^)) = E^W(^)- JmE^ (37) 

x&i ^ ' xei 

Numerical simulations shows that they are usually very good approximations. 
6.2 Evaluation of the compression rate 

Using constructed coders we can get as near Shannon's entropy as we need. In 
this subsection we will evaluate this distance. It is very sensitive to parameters, so 
the evaluations will be very rough - only to find general dependence on the main 
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parameters. 

Having probability distribution of the states, we can now use ^ formula 

Impreciseness of our encoder is more or less random and we can only estimate its 
expected values, so for this estimation we can threat ^ and (es(x))^ as independent 
variables. It would also allow to separate compression rate losses into which comes 
from /, b parameters only and caused by impreciseness of the coder. 

\ln(4)^x2/ \n{A)^xx^ ln(4)^^V, P ln(6) ln(4) 

(39) 

For the precise initialization {J2s1si^s{x)y) intuitively shouldn't depend strongly 
on /, b parameters, but rather on n and probability distribution. Pessimistically us- 



ing (23 ) we can bound it from above by n , but in practice it's usually smaller than n. 



Let's focus on ScD initialization now. The term with fractional part of e is much 
smaller than the main source of imperfection, so we can omit it. 

~ -^^s (fe^ II (g,fefe;>-b<Xsl ~ ^) (^^ ~ q,b''s-l^<Xs] ) d.X = 

= -^Es jAyil' + - 1) HI) - hHb)) ^l{n-l)[^^+ log,(/) - ^) 
Usually the largest is the term with log^(/), so finally 

/ 62 in(4)ln(6) ^ ' 

Comparing to numerical simulations these estimations are very pessimistic: we get 
many times (like 10-100) smaller value, but general behavior log(/)// looks to be 
fulfilled. 



To summarize: in practice we rarely require that the coder is worse than optimal 
than e.g. 1/1000 which can be get using l/n being usually below 100 for ScD 
initialization. Eventually we can divide I into subranges initialized separately to 
improve preciseness. 
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6.3 Probability distribution of digits and symbols 

The fact that smaller number of states are more probable unfortunately makes that 
produced sequences aren't exactly uniform uncorrected sequences, what would be 
expected for example if we would like to use ANS in cryptography. We will analyze 
it briefly now and in the next section will be shown how to correct it. 

First of all let us assume that we are coding some sequence of symbols to produce 
sequence of digits. Look at fig. |2} The last transferred digit says in which subrange 
of states indistinguishable after bit transfer we are. So the the fact that P{x) is 
generally decreasing, makes that it's a bit more probable that this last transferred 
digit is 0. Let's estimate this probability to see how it depends on parameters. 

Using V{x) := V{\ogf,{x/l)) ~ logf,(a;) — \og^{l), we get the probability that this 
last (while coding) /first (while decoding) digit is 0: 

^(X.-0/^>'=-^-l ^1^1 + + ^/c.-2 _ 1) _ ^ ^^fe.-l _ 1) ^ 

_ v^(X, -1-1 fefc^-a ^ 1 ^{X,~l)/b>'s-^-l 1 ^ 

~ Z^i=0 (Z+i6''<i-i)ln(fe) bln{b) 2^i=0 i+l/b'^s'^ 



^ inxjb''^-' - 1) - mi/b"^-' log, ( 



61n{6) V '-V^'-s/ <^ ^ ^ ' i-VV ^ "-I ) b \^ l/b^s-i-i 

' qsb^'^-b'=^-^/l \ ^ 1 , ( jh^ , flfZlfn 1 ~ I 1 



where we've used T>{x + h) — T>{x) ~ hT>'{x) and the simplest approximation for 
harmonic numbers. We could get constant a few times smaller if we would take better 
approximation of harmonic numbers and the derivative in the middle of the range. If 
we are interested only in general parameters dependency, presented approximation 
is good enough. 

In the second range probability distribution of states decreases slower but ranges 
are larger. Analogous calculation gives + ;g bHn{b) ~ Qsb''")- 

If we sum these values, we get that while encoding symbol s, probability that 
the last digit while bit transfer will be is ^ + hHn{b) ■ 

If we average obtained correction over all possible symbols, we get that proba- 
bility is larger than uniform digit distribution by approximately 

(41) 



62 \nb I 



In fact this value is a few times smaller and in practice we can use large / like 
10^ — 10^ to make tables fit in cache memory, so this effect can be extremely weak. 
While estimating, probabihty uncertainty decreases with the square root of the 
number of events, so even observing this effect would require analysis of gigabytes 
of output. Retrieving some useful information like probability distribution of length 
of blocks would require much more data. For succeeding digits and correlations this 
effect will be accordingly smaller. We will see in the next section how to eventually 
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reduce it as many orders of magnitude as we want. 

Now let us focus on the opposite situation - we have some sequence of digits and 
we want to encode them into symbols of given probability distribution. This time 
states are not gathered into subranges as previously, but distributed randomly and 
more or less uniformly, so the differences should be much smaller. But if we need to 
more precisely evaluate their probability distribution than Ig/l, we can for example 
use our approximation of state probability distribution, so the probability that we 
will produce symbol s is approximately: 



This formula also says more precisely what probability distribution of symbols is 
encoded closest to the Shannon's entropy. Using it wc could also modify cod- 
ing/decoding functions to make better approximation of expected probability distri- 
bution of symbols using the same I. Shifting some appearances of symbol left(right) 
increases (decreases) its probability a bit. 



This section contains practical remarks for implementation of presented coders and 
some additional modifications which can improve some of their properties for cryp- 
tography and error correction purposes. 

7.1 Data compression 

Data compression programs are generally constructed in two ways: 

• We use constant probability distribution of symbols. It could be generally 
known for given type of data or estimated by statistical analysis of the file. In 
the second case it has to be stored in the compressed file, or 

• The used probability distribution is dynamically estimated while encoding the 
file, so that while decoding we can restore these estimations using already 
decoded symbols. This approach is a bit slower, but we don't need to store 
probability distribution tables, we process the file only once and we can get 
good compression rates with files in which probability distribution of symbols 
varies locally. 

ANS is perfect for the first case: using a table smaller than lOOkB we can get a 
very precise coder which encodes about 8bits for each use of the table. It has two 
problems: 




(42) 



7 Practical remarks and modifications 
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• For each probability distribution we have to make separate initiahzation. We 
could also store tables some number of them. Observe that while changing the 
coder, if b and / are the same, we can just use the same state. 

• Decoding and encoding are made in opposite direction - we get different se- 
quences for estimations. To solve this problem we should process the file twice: 
first make the whole prediction process from the beginning to the end, then 
encode it in backward order. Now we can make decompression straightforward. 

In Matt Mahoney's implementations (fpaqa, fpaqc in [5]) the data is divided 
into compressed separately segments, for which we store q from the prediction 
process. 

For ABS situation is a bit different - we have relatively quick to calculate mathe- 
matical formulas and much smaller space of probability distributions, but we can 
encode only one binary choice per step. We have generally two options: 

• Calculate formulas for every symbol while processing data - it is much more 
precise and because of it we can use large b to transfer a few bits at once, but 
it can be a bit slower (fpaqc), or 

• Store the tables for many possible q in memory - it has smaller precision, 
needs memory and time for initialization, but should be faster and we have 
large freedom of choosing coding/decoding functions (fpaqa). 

7.2 Bit transfer and storing the tables 

For ABS using the formulas we can use large b, but in other cases we should rather 
use b = 2. For ANS it means doing bit transfer many times in each step - this quick 
operation may became essential for the transfer rate of the coder. Intuition suggests 
that we should be able to join them into one operation per step: for example use 
AND with proper mask to get the bits and make corresponding bit shift right of the 
state. 

It looks like the first problem is the order of these bits - that coding and decoding 
use them in reverse directions. But in fact in each step we know how many bits we 
should transfer and so we can just use the same direction for coding and decoding. 

The larger problem is to determine this number of digits to transfer: ks — [x < Xs]. 
It requires the comparison and usage of small tables in which on different bits is 
encoded ks,Xs and maybe the mask. We could also store this information in the 
coding/decoding tables. 

Let's think how to store the tables to find a compromise between memory needs 
and speed. 

Coding tables require for given symbol (6 — 1)/^ values from {b — 1)1 possibil- 
ities. Usually Is isn't constant, so to optimize it for memory requirement we can 



7 PRACTICAL REMARKS AND MODIFICATIONS 



22 



encode it in one table of length (&— 1)/: store C{s,x) as C[begining[s]+x]+/ where 
beginning [s] :=(6 — 1) Is' ~ On the second side of memory/ speed compro- 

mise is storing the whole C. On some bits of values of this table we can store the 
number of transferred digits or even their sequence. 

The situation with decoding tables is simpler: we can use single table of length 
(b — 1)1 and store s and the number of new state on it's different bits. We could 
also encode there the number of digits to transfer or even their sequence. 

All these ideas require additional memory or time for using small tables. The 
best would be if while initialization we would generate low level code separate for 
each symbol - with specific Xs, transfers and bit masks. They can be stored such 
that choosing the behavior for s is just a jump some multiplicity of s positions. 

7.3 The initial state 

Stream coding/decoding requires choosing the initial state. The final state of one 
process has to be stored in the file to be able to reverse it. As it was previously 
mentioned - while changing coding tables, if /, 6 remains the same, we don't have to 
change the state. 

The initial state can be freely chosen - as a fixed number or randomly. We don't 
have to store intermediate states when we change the coding tables, but we have to 
store the final state. This state will be initial while decoding. 

The problem could be that we are wasting a few bits in this way. Usually it 
should be insignificant, but for example when we want to encode separately a huge 
number of small files, such bits could be essential. 

We can improve it by encoding some information in this initial state of the coder. 
We can do it for example by using a few steps of coding without bit transfer, starting 
from X = state. We can always do it using binary choices (ABS). Eventually we 
could use ANS, but it would require creating tables for additional ranges. 

7.4 Removing correlations 

In the previous section we have seen that the probability distribution of produced 
bits (digits) isn't perfectly uniform. It's very small effect and for correlations it 
would be even much smaller, but it could be significant if we would like for example 
use it as pseiidorandom number generator. We could use some additional layer 
of encryption to remove correlations, but we can also do it in simpler and faster way. 

The first idea to equilibrate probability distribution of digits is to negate (NOT) 
transferred digits for every second processed symbol - e.g. in steps of even number. 
In this way we would make that and 1 are equally probable, but there would 
remain some correlations - '00', '11' would be a bit more probable than '01', '10'. 
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If in one block of transferred bits we would have '0', it's a bit more probable that a 
few bits further (in the next block), we will have '1'. 

This idea can be thought as making XOR with '00000...' and '11111...' cyclically. 
We can improve it by using some longer, randomly looking sequence of numbers in 
{0, .., maxg ft'^"} range. They can be generated using some pseudorandom number 
generator or even chosen somehow optimally and fixed in the coder as its internal 
parameters. We have to be able to recreate this sequence for decoding and store the 
number of last position in the file. 

Now in each step of coding we take succeedingly numbers from this cyclical 
list and before transferring it, make XOR with the element from this list. While 
decoding we have to use the same list, and make XOR before using obtained bits. 
In this way we can reduce correlations as many orders of magnitude as we need. 
Blocks length varies practically randomly, so knowing this list wouldn't allow to 
remove this transformation. 

7.5 Artificial increasing the number of states 

Usually the number of states is (6 — 1)/, but we will see in the next chapter that 
sometimes it's not enough. There are generally two ways to artificially increase it 
exponentially: 

• Intermediate step(s) - the base of security of ANS based cryptosystem is that 
the length of blocks and the state varies practically randomly. These effects are 
very weakened if we want for example encrypt without compression standard 
data - bytes with uniform probability distribution. To cope with this problem 
we can for example introduce intermediate step with even randomly chosen 
probability distribution of symbols. 

Stream coder/decoder in one step changes a block of bits into a symbol or 
oppositely. We can combine such steps: decoder changes a block of bits into a 
symbol of given probability distribution and immediately encoder changes it 
into a new block of bits. Encoder and decoder have own completely separate 
states and modify them in opposites direction {{y,y') {{y — Cs}.{y' + 

Cs})). It looks like we move on a straight line on this two dimensional torus, 
but because of impreciseness, this hue diffuse in the second direction and 
asymptotically should cover this 'torus' uniformly - the total number of states 
is practically the square of the original one. Surprisingly, because they use 
separate states, encoder and decoder can be reverses of each other. 

This approach is slower, but can be useful for cryptographic applications. 

• Additional sequence of bits - while using ANS as error correction method, the 
internal state of the coder contains something like hash value of already pro- 
cessed message. So if it has small amount of possibilities, we can accidently get 
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the correct value with wrong correction. The search for the proper correction 
requires a lot of steps, so they should be as fast as possible. 

In the previous point in each step we've changed the whole internal state of 
the coder - each use of a table changes one part of it, so it is relatively slow. To 
make it faster, we should use the table only once per step - change only part 
of the internal state of the coder. We could do it sequentially, but it would 
just separate the data into subsequences processed separately. 

The example of practical way is to expand the state of the stream coder by 
some cyclical table (t) of bits (eventually short bit sequences). Now coding is 
to make bit transfer, then switch the youngest bit of this reduced state with bit 
in given position in this table, increase this position cyclically and finally use 
the coding table. Now decoding step: use decoding table, decrease position, 
switch the youngest bit and make bit transfer. 



Stream coding(s): 

{while (x^ Is) 

-[put mod(x,b) to output ; x= [x/bj } 
switch (x AND l)^t[i];i++; 
x=C(s ,x) 

} 



Stream decoding: 
{(s,x)=D(x) ; 
use s; 

i~; switch (x AND l)^t[i]; 
while (x^ I) 

x=xb+' digit from input' 

} 

To make that this bit shift doesn't get us out of Ig, we have to enforce that Ig 
are even. These switches increases a bit impreciseness of the coder, but if we 
switch only one bit, it is practically insignificant. 

This table has to be stored somehow in the output file. It's cyclical so instead 
of storing position, we can rotate it to make that decoding should be started 
with the first position. 

We see that we can in fast and simple way increase the number of internal 
state as much as we want. To make it faster we can represent this table of bits 
as one or a few large numbers. 

The problem is that this large state will be required to start decoding, so we have 
to store it in the file. If it is used for error correction, it has to be well protected. 
For this purpose the initial state of the coder should be some constant of the coder, 
which allow to make the final verification. Eventually we could also encode some 
information in this initial state of the coder as previously. 



8 Cryptographic applications 

Asymmetric numeral systems were created for data compression purposes, but this 
simple and looking new idea of coding, has some properties which makes it very 
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promising also for cryptography and error correction purposes. It can even fulfills 
all these purposes simultaneously. 

8.1 Pseudorandom number generator and hashing function 

Wc have seen that we can think about the state of the coder as some hidden random 
variable, which chooses current behavior - state change and produced bits. As we 
would expected from entropy coder - the output bit sequence is nearly uniform 
and practically uncorrelated. Unfortunately it's not perfect, but we can use not 
the whole state but only some of its youngest bits, what would reduce correlations 
greatly. Additionally we could for example use some set of masks as in the previous 
section. 

Pseudorandom number generators (PRNG) are initialized by so called seed state: 
it generates randomly looking sequences, but if we would use the same seed, obtained 
sequence would be also the same. To use PRNG in cryptography, it has to meet 
additional requirements: having a sequence generated by it, we cannot get any 
information about the seed or further/previous bits. In the next subsection we will 
see that with properly chosen parameters, we shouldn't be able even to reveal the 
sequence of symbols used to generate random bit sequence. 

So to use ANS as pseudorandom number generator we have to choose some 
coding function, for example initialized using the seed state. Now we have to feed it 
with some sequence of symbols. If this sequence is periodic, after some multiplicity 
of this period, the state of the coder would be the same - the bit sequence would be 
also periodic. But this period is much longer than the period of symbol sequence: 
about the number of internal states of the coder times. In the previous section 
we have seen that this number can be easily increased as much as wc want, so in 
practice the sequence of symbols can be taken from some very weak psedorandom 
generator, or even taken as some fixed periodic sequence. 

Hashing functions change files into some short randomly looking sequence of 
given length. We shouldn't be able to get any information about the file from it. 
Additionally we shouldn't even be able to find in practice way some two files which 
give the same value. To fulfill these requirements, we can for example increase the 
number of states of the coder by using additional table of bits as previously, process 
the message and for example return this table as the hash value. 

If we wouldn't increase the number of states, someone could find two prefixes 
giving the same state and switch them. Wc could also prevent finding two messages 
with the same hash value by encoding the message twice - forward and backward. For 
example we can decode the file into a sequence of symbols of some fixed/generated 
probability distribution, then change the state and encode it back into a sequence of 
digits. Without changing the state we would just get the same file, but any change 
would make that we just produce practically random sequence - we can now for 
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example combine some youngest bits of last used states to get the hash function. 

For this purpose extremely small correlations should be completely insignificant. 
Eventually we could easily reduce them if we need. 

8.2 Initialization for cryptosystem 

For given parameters we still have huge amount of coding functions with practically 
the same statistical properties, but producing a completely different encoded 
sequence. If we make selfcorrecting diffusion initialization using some PRNG 
initialized using given key, we would get practically unique coding function for this 
key. If we would use it to encode some information, it looks practically impossible 
to decode the message not knowing the key. We will now make a closer look at 
such approach to data encryption. 

First of all, let us focus on the ScD initialization. It's large number of picking 

a random symbol from some large table. The coding table is approximately given 
by symbol probability distribution, but it looks practically impossible to find its 
precise values not knowing the key. The initialization process strongly depends on 
its history, which creates specific symbol distribution in the symbols table - while 
knowing the key, it looks practically impossible to find C{x, s) without making whole 
previous initialization (for smaller x). 

So to start decoding we practically have to make whole initialization. Observe 
that we can enforce PNRG to require as large time to be calculated as we want, for 
example: 

for i=l to N do {k=randoin; read k random values} 

makes that we statistically know approximately in which position of PRNG we will 
be. But to find the the exact position, we just have to make all calculations. 

We see that this way we can enforce some time required for initialization. 
Connecting it with the unpredictability of ScD initialization, we see that such cryp- 
tosystem would be extremely resistent to brute force attacks. Standard approach 
makes all computations while processing the file, so to check if given key is correct 
we can just start decrypting the file and observe if the output for example isn't a 
completely random sequence. In the presented approach, most of computation is 
made while initialization: to check if given key is correct we have to spend given time 
to make the initialization, for example enforced to take about 0.1s - it's a few orders 
of magnitude larger than in standard approach. After initialization the processing 
of the data uses already calculated tables - is much faster than in standard approach. 

Now assume that someone would get the coding function - does it mean that 
he can retrieve the key? This function says symbols chosen in each steps, but each 
symbol could be chosen in many ways, so in fact he wouldn't have sequence of used 
random variables, but only some sets of its possible values - even using a weak 
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PRNG it looks practically impossible to deduce the key. Eventually we could use 
some secure PRNG, for which it is ensured that knowing the exact sequence, we 
couldn't find its seed state and so the key. 

This property suggests extremely powerful additional protection - use not only 
the key as the seed state, but also some number which can be even stored in the 
file. Now after every encrypted fixed number of bits (like gigabyte), we change this 
number, store it and use to generate new coding tables. The size of these blocks 
should be chosen so that it wouldn't be possible to retrieve any essential information 
from them. The behavior of each one is practically unrelated, so their information 
couldn't be connected for finding the key. 

Sometimes we would like to make encryption and entropy coding in the same 
time. The question is - what to do with the probability distribution of symbols. 
There wouldn't be a problem if we would use some adaptive prediction method, 
but it would also require using many different coding tables. We will see that these 
tables should be rather large, so sometimes it might be better to use fixed probability 
distribution of symbols. It has to be stored in the header and so is easily accessible. 
We will see that such knowledge shouldn't rather make breaking the code easier, 
but it gives some knowledge of file content, what can be unwanted. To prevent it, 
this header can be encrypted separately using the same key but probably in some 
different way. 

8.3 Processing the data 

The coder uses the state which is hidden practically random variable. Also hidden, 
randomly generated local behavior of the coding function defines current behavior - 
how many digits to produce and to which state jump. Blocks created this way are 
relatively short, but they have various, practically randomly chosen length. This 
picture looks perfect, but unfortunately there are some weaknesses which could give 
some information about statistics of symbols or even coding function. They would 
vanish if we would use some additional layer of standard encryption, but I will try 
to convince that using only ANS with proper parameters and some quick and simple 
modifications, we can make really safe and fast encryption. 

• First of all, as it was mentioned in the previous section - the base of the 
randomness of the state is that we don't use uniform distribution of 
symbols (asymmetry) and that some symbols has probability not being an 
integer power of b. These assumption is in practice automatically fulfilled when 
we make encryption and entropy coding in the same time, but sometimes we 
would like just to encrypt some more or less uniform byte sequence. The best 
way to cope with this problem is to use the intermediate step from the previous 
section - using the same PRNG choose some probability distribution of symbols 
and then in one step encode a byte into a symbol and immediately use it to 
produce output bit sequence. Alternatively if we want to make it quicker - use 
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only one step: we can use the same PRNG to modify randomly the uniform 
distribution among bytes a bit and treat input sequence as sequence of such 
symbols. The cost is that the state doesn't change as fast as previously and 
that the output file is a bit larger than the input. We have also smaller amount 
of possible states of the coder this way. Eventually we could use so called 
homophonic substitution - to each symbol assign a few new ones and choose 
among them using some separate (hardware) random number generator, but 
it would increase the size of the message. 

• If we would encode the same sequence starting from the same state of the coder, 
we will get the same output. To prevent attacks based on such situations, 
we should increase the amount of its internal states. In the previous 
section were shown some ways to do it - use some correlation removing method, 
intermediate step or additional bit sequence. 

• As it was previously mentioned, because the probability of being in given 
state (x) is not uniform, but is decreasing (oc 1/x), some produced blocks 
of digits are a bit more probable (with smaller digits). These differences are 
extremely small and because of various block length I don't see a way to 
use it to find given block structure or some precise information about coding 
function. But analyzing statistically huge amount of data, one could evaluate 
probability distribution of block lengths, which gives some information about 
probability distribution of symbols. To prevent it we can use some of presented 
method of removing correlations. We could also generate sometimes new 
coding function for example with the same key but with some new additional, 
presented number. 

• Transferred digits are the youngest digits of the state. If one would have 
both ciphertext and corresponding plaintext, would make a correct assumption 
about the internal state and blocking in given moment and knew precisely 
used probability distribution of symbols, he could track the history of the 
processing, which would reveal used coding table. Let's focus on such scenario. 
Knowing probability distribution of symbol, we know that x ^ i^k^-ixKXs] ■ 
If we used ScD initialization, the impreciseness of such prediction of x is of 
Vl order. The transferred digits give precise position in the range of width 
jjks-[x<Xs] [ijq^ at average). So if I is Icirge enough, 

I > qj' (43) 

in presented scenario the number of possibilities the person would have to 
consider would grow exponentially, making such attack completely impractical. 
Observe that qs is at average 1/n, so above condition tells also that I > v?. 

In practice any presented method for increasing the number of internal states 
should also prevent such scenarios. 
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• Having a lot of ciphertext and corresponding plaintext, one could try to make 
some statistical analysis to connect symbols with blocks. Because of various 
length of blocks it doesn't look practical, but to prevent such eventualities it 
would be expected that every symbol could produce practically all possible 
youngest digits of the state. Given symbol can produce {b — l)ls — {b — l)lqs 
different states and in importance (shown in the encrypted file) are let say ks 
youngest digits {I/Qs values at average), so we again get Vl > I/Qs condition. 
Again any other modification would also give good protection. 

• If one can use initialized coder (adaptive scenario) and has some message en- 
crypted with it, he could try to use different inputs, slowly exposing succeeding 
bits of the plaintext. This is unavoidable weakness of using short block length 
cryptosystems. Fortunately there is simple universal protection against such 
rare scenarios: add a few random bytes at the beginning of the file before 
starting encryption or choose the initial state randomly (this time not using 
PRNG used for initialization). In this way while encrypting the same data, 
we will most probably get different output, which still can be decrypted into 
the same input data. 

I cannot assure that this list is compete, but for this moment I cannot think of more 
weaknesses which could be used to break ANS based encryption. We can easily 
protect against all of them. 

To summarize, while designing a cryptosystem base on ANS, we should: 

• Ensure the asymmetry - that the probability distribution of symbols is not 
uniform, 

• Use 6 = 2 for which state probability distribution is nearest uniform, 

• Use large I > or even / > qj"^. So to make coder faster (larger n), we 
should use correspondingly large tables, 

• Use some correlation removing modification and eventually increase addition- 
ally the number of internal states of the coder, 

• Eventually choose randomly the initial state of coder. 



9 Near Shannon's limit error correction method 

While compressing a file we remove some redundancy caused by statistical proper- 
ties. Using forward error correction methods, we are adding some easily recogniz- 
able redundancy, which can be used to correct some errors. In standard approach 
we usually divide the message into short independent blocks. The problem is that 
it's vulnerable to pessimistic cases (large local error concentrations) - if the number 
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of errors exceeds some boundary, we loose the whole block. In this section will be 
shown how to connect their redundancy to treat the whole message as one block. I 
will focus on using ANS for this purpose, but presented approach is more general - 
fig. [3] shows how to use it for any block code and a hashing function or even only a 
hashing function. 

For simplicity let us assume the simplest channel for this paper: memoryless, 
symmetric. That means that there is some fixed probability (p^ G [0,0.5)) that 
transmitted bit will be changed (0^1). So while transmitting bits, about Npb 
of them will be damaged. 

For a channel of given statistics of errors (noise), we can say about Shannon's 
limit - theoretical maximal information transfer rate. Constructions used to 
show that this limit is achievable are completely impractical. Near this limit 
are Low-Density Parity-Check Codes (LDPC) ([8], [9]), but they still they require 
solving NP-problem to correct. So in practice there are used codes which divide 
the message into independent blocks, what makes them vulnerable to pessimistic 
cases. For example for ph = 0.01, we should be able to construct a method which 
adds asymptotically a bit more than 0.088 bits of redundancy/transmitted bit 
and is able to fully repair the message. Compare it with commonly used (7,4) 
Hamming codes - it adds 3 bits of redundancy per 4 transmitted bits to be able to 
correct 1 damaged bit per such 7 bit block. It uses much more redundancy: 0.75 
bits/transmitted bit, but because sometimes we have more than one error in block, 
we loose about 16bits/transmitted kilobyte and we don't even know about it. 

Imagine we have some channel with known statistical model of error distribution. 
To transmit some undamaged message through it, we have to add some redundancy 
'above' given error density. We know only statistics of errors, not when exactly they 
will appear - so this density of redundancy should be chosen practically constant. 
But the density of errors fluctuates - sometimes is locally high, sometimes low. We 
see that while dividing the data into independent blocks, we have to choose the 
density of redundancy accordingly to some pessimistic error density in such block. 
But in fact there usually isn't some pessimistic level - we only know that the worse 
case, the rarer it occurs. So in this way for most of blocks were used much more 
redundancy then required, but for some of them this amount is still not sufficient. 

We see that to obtain a really good correction method, we should treat the 
message as the whole. In LDPC it is made by distributing uniformly some large 
amount of parity checks. Presented approach divides the message into blocks, but 
their redundancy is connected by the internal state of the coder, which contains 
something like checksum of already processed message to choose local behavior. 
Using these redundancy connections we can intuitively 'transfer' surpluses of unused 
redundancy to cope with pessimistic cases. We will see that we are able to get near 
Shannon's limit this way with practically linear expected time of correction. 
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9.0.1 Advantage of connecting redundancy of blocks 



There will be now shown two approaches of connecting redundancy of blocks as 
in fig. |3| In fact we will do it later in more fiexible and usually faster way, but 
these approaches show advantages of this new correction mechanism. Analysis and 
methodology from further subsections apply also to these approaches. 
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Figure 3: Two simple schemes of block codes with connected redundancy. On the 
top there is example of usage of standard triple modular redundancy block code - 
we send three copies of each bit and decode as the value with more appearances. 
In the middle it is shown how to modify it to connect redundancy of blocks - we 
use hash value of already processed encoded message stored in the state of decoder 
- modify the original block by making XOR with some 3 bits of this state. On the 
bottom there is simpler approach, which don't need a standard block code - in one 
block we place some 2 bits of the state of decoder (kind of checksum) and the bit to 
encode in the third one. 

The basic tool to connect redundancy of blocks is some hash function, which 
allows to deterministically assign some shorter, practically random bit sequence to 
already processed encoded message. In practice it is usually done by some automate 
which has a state containing such hash value to given position and changes this state 
while processing succeeding portions of the message. 

Look at the middle of the figure - after using the original block code (like 1 — >■ 
111), we make XOR with some bits of this state, containing practically random 
bits determined by already processed message (like X0R(111, 110) = 001). Now 
while decoding - if given block wasn't damaged and the state is correct, XOR of 
the state and the block should be a codeword (000 or 111). If not - we know that 
there was a damage. Now as long as in each block there is at most one damaged 
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bit, we immediately know which bit we should repair - the behavior is similar as 
for independent blocks. The advantage starts when there occurs more errors in one 
block - the state from this moment will most probably be different than expected 
and so for each block with some probability {pd = 1 — 2/2^ = 3/4), we should observe 
that it's damaged - it's much more often than expected for the proper correction 
and so suggests where to search for additional corrections. 

The bottom example shows that we don't really need to base on a block code - 
in some positions of the block we place some bits of the state(checksum) and bit(s) 
we want to encode in the rest of them. Now we immediately observe if positions 
with checksum were damaged. If the essential bits were damaged, we will conclude 
it later thanks of this new correction mechanism (p^, is still 3/4). We will sec that 
as it suggests - using only this new correction mechanism, we can already get near 
Shannon's limit. In the last subsection will be shown how to connect it with block 
codes mechanism as in the middle picture to correct simple damages immediately, 
additionally quickening the correction process. 

9.1 Very short introduction to error correction 

Forward error correction can be imagined that among all sequences, we choose some 
allowed ones - so called codewords. They have to be 'far' enough from each other, so 
that when we receive a damaged sequence, we should be able to uniquely determine 
the 'nearest' allowed one. Of course we would also want that the probability that 
it's really the correct sequence is large enough. In other words we divide the space 
of all sequences into separate subsets - kind of balls around codewords. The 'thicker' 
these balls are, the larger probability that we make the correction properly. 

In standard approach wc divide the message into blocks of fixed length, which 
are encoded independently. In this case we can use Hamming distance - the number 
of positions on which given two sequences of bits differ. For example triple modular 
redundancy code uses 3 bit sequence to encode 1 bit - in the space of 2^ possible 3 
bit sequences, there are chosen 2 codewords (000, 111), which are centers of balls of 
Hamming radius 1. So if while transmitting given 3 bit block, at most one bit was 
changed, we can correct it properly. If the number of changed bits is larger than 
one, we get into a different ball - it is corrected in wrong way and we even don't 
know it. 

Let us focus on a memorylcss symmetric channel: if we received '1', with prob- 
ability 1 — pb it was really '1' and with probability p^ it had to be '0'. If we would 
know in which of these cases we are, we would get exactly one bit of information. To 
distinguish them is needed h{pb) = —pb Ig(pfe) — (1 — pb) lg(l — p;,) bits of information, 
so such 'uncertain bit' contains 1 — h{pb) bits of information - to transmit N real 
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bits, we have to transmit at least 

N It-^ (44) 

such 'uncertain bits' - it's so called Shannon's limit and the channel coding theorem 
says that theoretically we can get as near as we want to this capacity. It means 
that for a channel with given statistics of error, we should be able to construct an 
error correction method which uses a bit more than -4^, - 1 = bits of 

redundancy per bit of message and is able to completely repair the message. 



While working on such potentially infinite blocks, the number of damaged bits 
tends to infinity, so we can no longer work on the Hamming distance. Now while 
transmitting given codeword T of length N bits, will be received a message R with 
damaged bits on some more or less Npi, positions. The positions of these errors can 
be stored as length bit sequence E, such that 



R = T®E 



(45) 



where © denotes addition modulo 2 of two bit vectors (XOR). This E vector statis- 
tically should be chosen as one of j ^ 2^'^(Pf) possibilities. From |45| we see that 
it's also the number of possible received sequences corresponding to one codeword. 
If we will divide the number of all possible received sequences by this number, we 
can get Shannon's limit again: 2^ 12^^^^"^ = 2^(i-'^(p6)). 

In fact the number of damaged bits is close but rather not exactly equal Npb. 
But if we have some large number (A^) of independent identically distributed random 
variables of entropy H, their outcome is almost certain to be in some set of size 
2^^, which all members have probability 'close to' 2~^^ - it's so called 'asymptotic 
equipartition' property (|S]). This set is called typical set, for example: 



X e {0,1} 



N 



N 



1 



=0} 



(46) 



For all /? > and correspondingly large N, such set contains almost whole prob- 
ability. Subrange of typical set is asymptotically also typical, so these practically 
Np copies of '1' should be spread more or less uniformly. 



Shannon's coding theorem says that we can get as close to the theoretical limit as 
we want and we should be able to correct practically all possible typical errors. So we 
should look for the proper correction among typical ones with ph probability of '1'. 
Standard proof generates the set of codewords randomly, modify this set (remove 
some codewords) and show that for large A^, with probability asymptotically going 
to 1, we can properly determine transmitted codeword. Unfortunately it would 
require to check exponentially large set of possible corrections - it rather cannot be 
done in practice. 
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9.2 Path tracing approach 

First of all, let us focus on sketch of different but still impractical proof: using 
a hashing function. Such function allows to assign to each message some shorter, 
practically random bit sequence. Assume now that we transmit the original message 
of length bits through the channel and its 'a bit longer' than Nh{pb) bits hash 
value through some different noiseless channel. Now the receiver can check 'all 
typical corrections' (2^'*'^^'')) of the received message and almost certainly only one 
(the proper one) will give the expected hash value. If we would like to transfer the 
hash value through the same noisy channel, we can analogously send additionally its 
hash value and so on (until it's smaller than some limit value which can be encoded 
in some different way). So finally we would asymptotically need at least 

N{l + h{pb) + h\pi,) + ...) = ^ bits 

1 - h{pb) 

Observe that 'a bit longer' can mean that the number of hash values is larger only 
polynomially with than the number of typical corrections - there still almost 
certainly will be only one proper typical correction and we will get asymptotically 
exactly Shannon's capacity. 

There has left to precise what does 'all typical corrections' mean. For a 



theoretically infinite data stream we should be able to take /? — limit in (46). 
It can be achieved by intersecting sets of corrections passing verification for some 
sequence jSi 0. For a data stream of a finite length, we could take proper 
correction which is nearest to typicality (smallest (3), but we will see that the best 
will be such that corrects the smallest number of bits. 



We will now modify this method to make it practical - instead of making huge 
verification once, intuitively we will spread it uniformly over the whole message. 
Thanks of it we will be able to detect errors not only on the end of the process, 
but also shortly after they appear: after an error in each step we have some fixed 
probability {pa) to detect that something was wrong. We will pay for this parameter 
in capacity, but when it exceeds some critical point, the number of corrections not 
detected by this mechanism will no longer grow exponentially. So we will no longer 
require that the amount of possible hash values (states of the coder) should grow 
exponentially - the relative cost of storing it will vanish asymptotically. 

This threshold corresponds to the Shannon's capacity, but in practical (nearly 
linear) correction methods there appears some additional problem with large errors 
concentrations and we should use a bit larger capacity. We will see it in the next 
subsection. 

The situation looks like in fig. |4j the transmitted codeword (correct path) is 
denoted by the thick line. We start with the fixed initial state and try to process 
succeeding bits. While we are on the correct path, the state changes in randomly 
looking way among all allowed states. After an error (we've lost the path), the state 
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fixed initial state •errors wrong corrections fixed final state 



Figure 4: Schematic picture of path tracing correction algorithm. If the number of 
states is large enough, corrections to consider should no longer create cycles as in 
the figure and so create tree. 

also changes in randomly looking way, but this time among all states - in each step 
there is some probabihty (pd) that we will get to a forbidden state (observe that 
something was wrong) . 

The problem is that after an error, before it will be detected, the coder can 
accidently get into the correct state - we would go back to the correct path without 
a possibility to detect that we've made a wrong correction. We will see later that 
with proper selection of parameters, errors will appear slower than we can correct 
them - probability of such situation will drop asymptotically to zero. 

We can use entropy coder with internal state for such path tracing purpose: add 
a forbidden symbol of probability pd, marking its appearances as forbidden states 
and rescale correspondingly probabilities of the rest of symbols (allowed ones). We 
could eventually use arithmetic coder, but ANS is faster, has useful modification 
capabilities and is generally simpler, so I will concentrate on it. 

If wc want to encode a symbol sequence with {qs)s probability distribution, we 
have to use correspondingly ((1 — Pd)(ls)s probability distribution instead. Now 
while encoding we use only these allowed symbols. If there wouldn't be errors, 
while decoding we would also use only allowed symbols, but after an error we would 
produce practically random sequence of symbols, so in each step we have probability 
Pd of trying to produce the forbidden symbol and so detecting that there was an 
error. 

To use ANS for this purpose we would rather need to use some method to 
increase the number of internal states of the coder to reduce probability of wrong 
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correction. The initial state can be generally fixed for encoding process, but the 
final one has to be stored, probably in the header of the file. So there have to be 
used some separate strong error correction method for it to make that we can be 
sure that this initial decoder state is proper. 

What is the cost of adding such forbidden symbol? The data sequence contains 
at average H = — J2s Is lg(?s) bits per symbol. After the rescaling, we will use at 
average 

H':=-J2 Is lg((l - Pd)qs) = H- lg(l - p^) bits/symbol. (47) 

s 

There will be shown now intuitive argument that choosing as in Shannon's limit: 

-Hl-V.)>H^^ (48) 

the possible space of hash values (states of the coder) wouldn't longer have to grow 
exponentially as for = from the beginning of this subsection. In this case the 
cost of storing this (protected) hash value would vanish asymptotically. Denote 

pI = 1- 2-t^'^(^') 1 - 2-^''^(P'')) (49) 

this threshold value. For simplicity let us assume that 6 = 2. 

Unfortunately there is a very subtle problem with this argument, which will be 
explained and precisely analyzed in the next subsection. 

Assume we've received some message of length N bits. We will use the simplest 
method in this moment: as before try to correct it using 'all typical corrections' and 
check if they pass the verification: while decoding we would use only allowed states 
and the final state is correct. As in the picture, such message agrees with the correct 
one before the first error. Then it can vary according to the noise until it reaches a 
forbidden state or the correct state for given point. 

Let us assume that in j > steps after an error it still didn't reach a forbidden 
state. If we wouldn't accidently get to the allowed state, the probability of such 
situation is (1 — PdY ■ One step corresponds to at average H encoded bits which 
correspond to at average H — Ig(pd) transmitted bits, so in j steps we processed at 
average j {H — Ig(pd)) bits. They can freely change according to the noise, so we 
should check about 2^^^~^^^'^^'^'^^'^''^ their corrections. If we choose pd such that 

1 > (1 — p^y2^(^-'^s(M)HPb) = (2}giPd)+{H-ig{pd))h{pt)y ^gQ^ 



the expected number of such corrections not rejected by this mechanism will no 
longer grow exponentially. This threshold is exactly the Shannon's limit (48). 
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We can now intuitively estimate the probability of wrong correction scenarios 
as in the picture - that we can start with an error from a correct state in some 
point of time and after some typical noise accidently get back to some correct state. 
There are almost possible starting points for such scenario. If pd > the 
expected number of corrections which errors won't be detected by the forbidden 
states mechanism doesn't longer grow - it usually even drops to zero with the width 
of such subrange (j). So the expected number of such scenarios can be bounded 
from above by N'^. If the number of states of the coder behaves for example like 
A^^, almost certainly only the proper correction will pass the verification. 

To store protected one of A^^ values we need about a bit more than 3 Ig(A^) bits 
- while calculating channel's capacity this cost vanishes asymptotically. 



9.3 Practical correction algorithms 

Methods constructed on proofs of that we can approach Shannon's limit requires 
clearly exponential correction time, like checking all typical corrections. In this 
subsection I'll try to convince that practical (nearly linear) correction methods have 
to require redundancy level above some found higher limit. Then there will be 
presented general approach to correction - building correction trees. For basic choice 
of weight it requires a bit more redundancy than this new limit. 



9.3.1 Practical correction limit 

Generally if we want to find correction in practically linear time, we rather cannot 
work on corrections of the whole message (exponential number), but should slowly 
elongate them. So practical correction algorithms should use enough redundancy 
to ensure that expected number of corrections to be considered up to given point is 
finite. 

The problem is that in fact we don't know the number of damaged bits, only 
that they appear with pb probability - that asymptotically probability distribution 
of the number of damaged bits is Gaussian with y^Npi,(l — p^) standard deviation. 
This uncertainty vanishes asymptotically, but surprisingly has essential influence 
on the expected number of corrections to be considered in given moment (width of 
the correction tree) - rarely there are very large local error concentrations which 



result in infinite expected width - it essentially influence (50) 



We will see it formally later, but intuitively among corrections which survived 
to given moment, the less bits they changed, the more probable they are. It 
suggests the simplest correction algorithm - moving the 'front' of the tree - 
for given moment remember some number (M) of corrections which passed 
verification with the smallest number of corrected bits. Now in each step try to 
expand all of them and take only the best M of those which gave some allowed state. 
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The question is: how many of them (M) we should remember? 

In other words - which in this order is the proper one? The larger M is, the slower 
the algorithm, but also the larger probability that there is the proper correction 
among considered ones. If in given moment this number is not enough, we lost this 
proper correction. Fortunately we can observe it - from this moment the number 
of damages will have to be larger than expected (about h^^{—lg{pfi)/H')) - it can 
suggest to go back and use locally larger M. Later we will do it smarter, but for 
this moment assume that we use always large enough various M. 

To summarize: we have to make that the expected number of corrections which 
passed the verification up to given moment (J bits) and is changing smaller number 
of bits than the proper correction is finite. 

The probability distribution of the number of damaged bits is Gaussian with 
center in p^J and standard deviation ^y Jpb{l — Pb)- If there was in fact pJ damaged 
bits, the number of wrong corrections with at most this number of damaged bits 
will be asymptotically dominated by [^j) ^ (27rJpp)~^/^2'^'*(*') and about p^^^ of 
them are expected to survive these about J/H' steps. 

So the expected number of those which survived is asymptotically approximately 

J 1 Tl,/ \ 1 (Pj-Pbjf 

{l-Pd}^ ■ , 2-^^(*') ■ ^ e ^-^fbd-Pb) Jdp 

It is finite for J — > oo if 

This function of p has only one maximum - a bit above Ph as expected. This p 
corresponds to cases which infiuence most the expected number of corrections. 
Finally the integral is finite if we chose pd > Pa- 

:= 1 - 2"^''"''"-^l°'il(''^^^"Wr^^^"^''^') ( > p°) (51) 

Using H' ^ H - Igpa, we get 

H' 1 

= (^52) 

^ 1 - maXpe[o,i] (hip) - ^plfi-p,) iP - Pbf) 

Comparing to Shannon's limit - it's a bit larger for small Pb, up to twice larger for 
Pb 0.5. This formula can be used to choose pd = 1 — 2^^^~^ 

To summarize: if we would use pd G [p^jP^], while trying to consider some best 
corrections up to given moment, we asymptotically would need exponential number 
of steps. Because we rather use polynomially large number of states of the coder, 
eventually found correction will be probably wrong. If we can tolerate exponential 
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Figure 5: Comparison of Shannon's limit (p^), limit for practical correction algo- 
rithms (p^) and reached by the basic version of correction tree algorithm (p^). 

correction time, we can use exponential number of states and smaller pd getting 
better limit and finally Shannon's limit for pd = as in the beginning of the previous 
subsection. 

In practice we use finite blocks, so choosing some large M, with large probability 
it will find the proper correction (in linear time). We can verify it: it's most probably 
the proper correction, if the density of damages didn't drastically grow from some 
point. We could focus on this point and try to correct this correction. It suggests 
how to choose M: use relatively small M first until error density grows much faster 
then it should. Now interpolate this point and try to use larger M around until 
error density won't return to expected. 

9.3.2 Correction tree algorithms 

Previously suggested algorithm was considering some number of best corrections 
to given point - we are expanding the tree of possible corrections by moving its 
whole 'front'. We will now look for some sophisticated algorithms - which in given 
moment selects some best node to expand. We should get some (pseudo) random 
tree with many subtrees of wrong corrections growing from the core made of the 
proper correction. These subtrees has generally larger error concentration. 

Each dot in fig. [6] corresponds to some allowed state of decoder. When we get to 
a forbidden state, we have to expand somewhere else. Edges of such tree correspond 
to some corrections of bits used in given step - intuitively the less bits it correct, the 
more probable it is - we should try it earlier. 

To create logical structure of the tree, we have generally three possibilities: 

1. make node for each bit - split denotes changing one bit - it's rather impractical, 
or 

2. make node for each step - split denotes correction of bit block used while the 
last step - good for large Pb, or 
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Figure 6: Correction algorithm. It will create such (pseudo) random trees. If 
Pd < Pd this tree would immediately grow exponentially. If Pd — p° it would look 
to grow linearly. If < < and we use basic weight function, its width will 
be generally small, but rare high error concentrations will sum to infinite expected 
width. If Pd > p\ its expected width will be finite - it can be used for potentially 
infinite messages. This limit can be probably improved up to p\. 



3. make node every time forbidden state occurs (as in the figure) - we connect 
consecutive bit blocks as long as we can decode further without correction - 
good for small pi,. 

In each node we have to store somehow the pointer to its father, lately used correction 
and the position in the message. To choose quickly the best node for given moment, 
we will store there also some wight and always choose node with the largest one. 
For each node we can easily find its most probable, not considered yet child - they 
are denoted as 'triangles' in the figure. In given moment we can focus on them and 
mark further ones after using some. 

To summarize, the main loop of the algorithm is 

• Find the most probable correction not considered yet (one of 'triangles'), 

• Try to expand it - decode one step or until we get to a forbidden state, 

• Modify the tree - create new node and at most two 'triangles' - the first for 
the new node and the next one to the 'triangle' used. 

until we get to the fixed final state on the end of the message. 
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9.3.3 The weights of nodes of the correction tree 

We will work on two 'time scales' - j will denote the number of states of the decoder, 
which corresponds to at average J ~ H'j bits of the encoded message. 

In each step we have standard situation for error correction: we make some ob- 
servation (O) and we need to evaluate probabilities of its possible explanations {E). 
To cope with such problems we use Bayesian analysis, which says that probability 
of given situation is the probability that this explanation causes given symptoms 
multiplied by the probability of this explanation and normalized by the sum over 
all possible explanations: 

o .z.l^^ _ PriO\E)PriE) _ Pr{0\E)Pr{E) 

^r{t.\U)- p^^^^ - J2^,Pr{0\E')Pr{E') ^^"^^ 

In our case we observe some tree (O) and we want to find error distribution {E) 
which caused it - the most probable node to expand in this moment. We will treat 



E as in (45): it's {0, 1} vector in which on each position '1' denotes that we should 



change given bit. Finding Pr{E) is easy: 

PriE) = p#^'^^^'^-^'='\i - p^f{i<^<J--E.=o} (54) 

or accordingly some more complicated function if we don't assume symmetric, 
memoryless channel. 

The problem is to calculate Pr{0\E). We should use the real structure of the tree 
to find it, but it would require assuming algorithm which created it - it's becoming 
extremely complicated. 

We will focus now on some basic method: using only the number of al- 
lowed/forbidden states outside E. We will see that it will already give algorithm 
very close to found theoretical limit for practical correction algorithms. There will 
be also shown some ways to improve it later. 

If we assume that given correction [E) is proper, the states outside the corre- 
sponding path in our tree should fulfill statistics: pd of them are forbidden, I — Pd 
are allowed: 

Pr{0\E) = pforbidden^^ _ p^)#allowed 

Finally we can assign to each 'triangle' Pr{0\E)Pr{E) - it's some multiplication of 
powers of ph, 1 — Pb, Pd, 1 — Pd- The first pair corresponds to given correction, the 
second to the number of forbidden/allowed states in the rest of the tree. 

Observe that the number of forbidden/allowed states outside some path is the 
number of all forbidden/allowed states in the tree minus these in considered path 
(only allowed ones). So dividing Pr{o\e)Pr{e) by the term for the whole tree, we 
see that we have to maximize 

#{i:Ei=i} ~#{i:Ei=o} — # states in this correction 

Pb Pb Pd 



9 NEAR SHANNON'S LIMIT ERROR CORRECTION METHOD 



42 



among all corrections (E) worth to consider in this moment ('triangles'). We want 
to find only the node with maximal weight, so we can work on logarithm of this 
value. Finally while building the tree, to calculate weight for given step of decoder, 
we should add 

#{z = \g{p,) + #{z : = 0} ■ Ig(pfe) - Igfe) (55) 

to the weight of the previous step, where this time E denotes bits used in the last 
decoding step. 

We also see that using the previous algorithm - moving the 'front' of the tree, 
really the most probable nodes are those with the smallest number of damaged bits. 
The term with Ig(prf) allows to handle with corrections having different lengths - 
additionally emphasizing longer corrections. 

Formally because of the search for the node with the largest weight, this algo- 
rithm has A^lg(A^) time complexity. In practice we usually need to work on relatively 
small number of nodes with the largest weight - required priority queue could have 
some fixed size. Very rarely it will run out and we will have to look through 'trian- 
gles' stored outside it. 



9.3.4 Analysis of correction tree algorithm with basic weights 

Let us assume that we will use this algorithm to correct some message: there is 
some unknown vector of errors (E) with probability of 1. 



Observe that asymptotically average weight per node (55) is 

H'pb Ig(Pb) + H'pb \g{pb) - Ig(pd) = -H'h{pb) - lg{pd) 

so the condition p^ > p° is equivalent with that the weight of the correct path is 
statistically growing. But error distribution is not uniform - the weight of the correct 
path can locally decrease. In such situation, before we will continue expanding the 
correct path, we have to expand some subtrees of wrong corrections. 

The problem is that when local concentration of errors is very large, the weight 
of the correct path drops dramatically and so we have to expand very large subtrees 
of wrong corrections. Probability of such scenarios decreases exponentially with the 
size of such weight drop (w), but the size of such subtrees grows exponentially with it. 



Such weight drop can have generally any length: observe that to make whole 
correction, from position J we have to expand subtrees of wrong corrections up to 
weight about: 

mm |# G [J, J + J') : e, = 1} lg(p,) + # e [ J, J + J') : E, = 0} \g{p,) - ^ \g{p,) 

We need to find expected probability distribution of such drops. It doesn't depend 
on the position: 

V{w) := probability that the weight will drop by at most w 



9 NEAR SHANNON'S LIMIT ERROR CORRECTION METHOD 



43 



For w < 0, V{w) = 0, but ^(0) corresponds to situation in which weight increases 
- it should be positive. 

Before going to the general case, let us focus for a moment on simplified one - 
that all blocks have exactly 1 bit [H' = 1). Observe that we can write equation for 
V for two succeeding positions: 



V{w) 



PbV{w + \g{pb) - Ig(pd)) + PbV{w + \g{pb) - Ig(Pd)) for w>0 
for w < 

(56) 

If there would be no such boundary of behaviors in w = 0, it would be simple 
functional equation - with some combination of exponents as solution. Fortunately 
we can use it to find the asymptotic behavior. 

We know that limu,_^oo V{w) = 1, so let as assume that asymptotically 

1 - V{w) oc 2""" (57) 



for some v < 0. Substituting it to (56), we get: 

2^"" = pf^2"(^+^s{Pb)-hiPd)) _|_ p^2^{w+\g{pb)-ig(Pd)) 

This equation has always v = solution, but for p^ > p° there emerges the second 
- negative solution we are interested in. It can be easily found numerically and 
simulations show that we are reaching asymptotically this behavior. 

We can now go to the general case - we use some probability distribution of block 
lengths: 

Pa := probability that decoding step will use a bits 
We have Ea = 1 and H' = Ea 



Writing (56) analogously, this time for block of length a we would get 2'* terms. 



After the substitution ([57j), we can collapse them: 

a 

a 

Again we are interested in the f < solution. The approximation on the right is 
fulfilled if blocks have constant length as before, but we should also be able to use 
it when there are used only two block lengths differing by 1. Generally we should 
be careful about it. 
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Now we have to estimate asymptotic behavior of size of subtrees of wrong cor- 
rection for large weight drops. Each node of such subtree can be thought as a root 
of new subtree. If we will expand it for given allowed weight drop, it should give 
fixed expected number of nodes: 

U(t) := expected number of processed nodes for at most t weight drop 

As again, U{t) = for t < 0. For t = we process this node: U{0) > 1. 

Let's focus on one bit blocks {H' = 1) as previously. Connecting node with its 
children, we get: 



U{w) 



l+Pd{U{w + \g{pb) - \g{pd)) + U{w + \g{pb) - \g{pd))) for U7 > 
for w < 



(60) 



This time we would expect that for some u > asymptotically 

U{w) oc 2""' (61) 



Substituting it to ( 60 ) as previously, we will get 

pr'=Pt+pt (62) 



This equation is very similar to (58), for pa > p^ we again get two solutions. As 
previously, because of strong boundary conditions in w = 0, we will be asymptot- 
ically reaching the smaller one, what confirms numerical simulations. Comparing 
these two equations, we surprisingly get simple correspondence between these two 
critical coefficients: 

u = v + l (63) 
which is also fulfilled in the general case with different length blocks. 

Having U (w) and V{w) functions, we can finally find the expected number of 
nodes in subtrees of wrong corrections per one node of the proper correction: 



oo 



PbU {w - \g{ph) + Ig(Pfe)) + PhU (w + \g{pb) - lg{pb)) dV{w) (64) 







Because V is not continuous, it's formally Stieltjes integral, but to estimate asymp- 
totic behavior we can use 

dviw) = ^^dw oc 

aw 

So the expected number of processed nodes per corrected bit is finite if and only if 

2 
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The critical pd is 

Pd^^^ = pI^^ + p]!'^ or generally: ^Jfd ^ = (v^ + V^) (65) 



Let us denote this value as p^: 

1 / 1 



Generally Pa depends on p^, so we should solve this equation numerically. We will 
use the approximation on the right to estimate critical channel capacity for this 
boundary: 

E' = H- lg(p2) ^H + 2H' lg(v/^ + v^) 
H' 1 



H l-21g(^+v/p^) 

It is at most 13.1% larger (for pi, ^ 0.03) than for the limit for practical correction 
algorithms (fig. [s]). Using pd > Pd we can be sure that the expected width of 
the tree is finite - using polynomially large number of states of the decoder, 
asymptotically almost certainly in practically linear time this algorithm will give 
the proper correction. 



This algorithm needs a bit more minimal redundancy then 'moving the front' of 
tree approach, because sometimes it is building huge subtrees of wrong corrections, 
which goes much further then the proper node. It's caused by — lg{pd) term in the 
weight function - it amortizes large error density by the length. We see that we 
could improve the correction tree algorithm, by sometimes switching to the 'moving 
the front' algorithm, sometimes enforcing expansion of shorter paths. 

For example: if the width of the tree or local error density exceeds some value, 
make some number of steps using only 'triangles' having position below some bound- 
ary, like this position of largest width. Unfortunately I couldn't find optimal pa- 
rameters analytically, but they can be found experimentally. 



9.4 Generalized block codes 

In the previous two subsections we were using two correction mechanisms - that the 
final hash value (state of the coder) has to agree and that after an error in each 
step with probability pd we will see that something was wrong. In this subsection 
we will see how to use huge freedom of choice while choosing ANS coding tables to 
include additional error correction mechanisms known from standard block codes 
as in fig. [3| This mechanism allows to immediately repair simple damages and so 
reduces the number of usage of decoding table, quickening the process. Finally it 
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can be imagined as block codes in which blocks are no longer independent, but have 
connected redundancy. This connection is made by the internal state of the coder. 

This time pd is rather large (at least 1/2), so we should use larger H, for 
example by treating a few bits as a symbol to be encoded in one step. 

The idea is to make that Hamming distances between different allowed states are 
at least some fixed value. For distance 2 it can be easily done by inserting additional 
parity check bit, for example as the one before the oldest bit (which is always 1). 
In this case we can just use ScD initialization for the original symbol probability 
distribution and then insert parity bit, use 21 instead of / and mark the rest (half) 
of states as forbidden. In this case = 1/2. 

The advantage of such distance 2 code is that if among bits of one block there 
is only a single error, it is detected immediately. So forbidden state denotes that 
there was damaged one of bits used in the last step or there was at least two er- 
rors in some previous block - the set of possible correction is smaller than previously. 

We could also enforce larger Hamming distance. Observe that e.g. triple modular 
redundancy codes codes can be imagined as obtained this way: 1 = 2^, 6 = 2, allowed 
states are '1000' and '1111' - both symbols (0 and 1) has exactly one appearance. 
The rest of states are forbidden. While decoding step, before bit transfer, the state 
is always 1, so this example is degenerated - blocks are independent. 

Generally let us take K as the maximum of kg for all allowed symbols (after 
rescaling) - so that bit transfer uses at most K bits. We should enforce that for 
each allowed state, all states with changed at most given number of bits among 
these youngest K positions are forbidden. In other words, ii I = 2^, for each oldest 
L + 1 — K bits we should make that two allowed states has Hamming distance at 
least given value - creates some block code on K youngest bits. 

To make the connection of redundancy work, there have to be used many essen- 
tially different block codes. We can generate them from a single one: by making 
XOR with some K bit mask as in fig. |3] and by using some permuting on these bits 
- these operations make that new codewords has still the same minimal Hamming 
distance. 

So finally to mark allowed states, for each oldest L + l — K bits we should choose 
some K bit mask to make XOR with and eventually some permutation of K bits of 
the original block code. Then we can distribute allowed symbols among them. To 
make these choices, especially when we want to make encryption simultaneously, we 
can use PRNG as before (initialized with the key). 
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